This paper describes a novel approach to learning term-weighting schemes(TWSs) in the context of text classification. In text mining a TWS determinesthe way in which documents will be represented in a vector space model, beforeapplying a classifier. Whereas acceptable performance has been obtained withstandard TWSs (e.g., Boolean and term-frequency schemes), the definition ofTWSs has been traditionally an art. Further, it is still a difficult task todetermine what is the best TWS for a particular problem and it is not clearyet, whether better schemes, than those currently available, can be generatedby combining known TWS. We propose in this article a genetic program that aimsat learning effective TWSs that can improve the performance of current schemesin text classification. The genetic program learns how to combine a set ofbasic units to give rise to discriminative TWSs. We report an extensiveexperimental study comprising data sets from thematic and non-thematic textclassification as well as from image classification. Our study shows thevalidity of the proposed method; in fact, we show that TWSs learned with thegenetic program outperform traditional schemes and other TWSs proposed inrecent works. Further, we show that TWSs learned from a specific domain can beeffectively used for other tasks.
展开▼